Common Cross Validation Test Code

We used the same cross validation test procedure for the three applications described in the paper. This document provides explanations for the code in analytics.py used in those tests.

See the tests carried out in each application:

Lists of features

In our experiments, we first test our trained classifiers using all 22 provenance network metrics as defined in the paper. We then repeat the test using only the generic network metrics (6) and only the provenance-specific network metrics (16). Comparing the performance from all three tests will help verify whether the provenance-specific network metrics bring added benefits to the classification application being discussed.

The lists of metrics combined, generic, and provenance are defined below.


In [1]:
# The 'combined' list has all the 22 metrics
feature_names_combined = (
    'entities', 'agents', 'activities',  # PROV types (for nodes)
    'nodes', 'edges', 'diameter', 'assortativity',  # standard metrics
    'acc', 'acc_e', 'acc_a', 'acc_ag',  # average clustering coefficients
    'mfd_e_e', 'mfd_e_a', 'mfd_e_ag',  # MFDs
    'mfd_a_e', 'mfd_a_a', 'mfd_a_ag',
    'mfd_ag_e', 'mfd_ag_a', 'mfd_ag_ag',
    'mfd_der',  # MFD derivations
    'powerlaw_alpha'  # Power Law
)
# The 'generic' list has 6 generic network metrics (that do not take provenance information into account)
feature_names_generic = (
    'nodes', 'edges', 'diameter', 'assortativity',  # standard metrics
    'acc',
    'powerlaw_alpha'  # Power Law
)
# The 'provenance' list has 16 provenance-specific network metrics
feature_names_provenance = (
    'entities', 'agents', 'activities',  # PROV types (for nodes)
    'acc_e', 'acc_a', 'acc_ag',  # average clustering coefficients
    'mfd_e_e', 'mfd_e_a', 'mfd_e_ag',  # MFDs
    'mfd_a_e', 'mfd_a_a', 'mfd_a_ag',
    'mfd_ag_e', 'mfd_ag_a', 'mfd_ag_ag',
    'mfd_der',  # MFD derivations
)
# The utitility of above threes set of metrics will be assessed in our experiements to
# understand whether provenance type information help us improve data classification performance
feature_name_lists = (
    ('combined', feature_names_combined),
    ('generic', feature_names_generic),
    ('provenance', feature_names_provenance)
)

Balancing Data

This section defines the data balancing function by over-sampling using the SMOTE algorithm (see SMOTE: Synthetic Minority Over-sampling Technique).

It takes a dataframe where each row contains the label (in column label) and the feature vector corresponding to that label. It returns a new dataframe of the same format, but with added rows resulted from the SMOTE oversampling process.


In [2]:
from imblearn.over_sampling import SMOTE
from collections import Counter

def balance_smote(df):
    X = df.drop('label', axis=1)
    Y = df.label
    print('Original data shapes:', X.shape, Y.shape)
    
    smoX, smoY = X, Y
    c = Counter(smoY)
    while (min(c.values()) < max(c.values())):  # check if all classes are balanced, if not balance the first minority class
        smote = SMOTE(ratio="auto", kind='regular')
        smoX, smoY = smote.fit_sample(smoX, smoY)
        c = Counter(smoY)
    
    print('Balanced data shapes:', smoX.shape, smoY.shape)
    df_balanced = pd.DataFrame(smoX, columns=X.columns)
    df_balanced['label'] = smoY
    return df_balanced

The t_confidence_interval method below calculate the 95% confidence interval for a given list of values.


In [3]:
def t_confidence_interval(an_array, alpha=0.95):
    s = np.std(an_array)
    n = len(an_array)
    return stats.t.interval(alpha=alpha, df=(n - 1), scale=(s / np.sqrt(n)))

Cross Validation Methodology

The following cv_test function carries out the cross validation test over n_iterations times and returns the accuracy scores and importance scores (for each feature). The cross validation steps are as follow:

  • Split the input dataset (X, Y) into a training set and a test set using Stratified K-fold method with k = 10
  • Train the Decision Tree classifier clf using the training set
  • Score the accuracy of the classifier clf on the test set
  • (Repeat the above until having done the required number of iterations)

In [4]:
def cv_test(X, Y, n_iterations=1000, test_id=""):
    accuracies = []
    importances = []
    while len(accuracies) < n_iterations:
        skf = model_selection.StratifiedKFold(n_splits=10, shuffle=True)
        for train, test in skf.split(X, Y):
            clf = tree.DecisionTreeClassifier()
            clf.fit(X.iloc[train], Y.iloc[train])
            accuracies.append(clf.score(X.iloc[test], Y.iloc[test]))
            importances.append(clf.feature_importances_)
    print("Accuracy: %.2f%% ±%.4f <-- %s" % (np.mean(accuracies) * 100, t_confidence_interval(accuracies)[1] * 100, test_id))
    return accuracies, importances

Experiments: Having defined the cross validation method above, we now run it on the dataset (df) using all the features (combined), only the generic network metrics (generic), and only the provenance-specific network metrics (provenance).


In [5]:
def test_classification(df, n_iterations=1000):
    results = pd.DataFrame()
    imps = pd.DataFrame()
    Y = df.label
    for feature_list_name, feature_names in feature_name_lists:
        X = df[list(feature_names)]
        accuracies, importances = cv_test(X, Y, n_iterations, test_id=feature_list_name)
        rs = pd.DataFrame(
            {
                'Metrics': feature_list_name,
                'Accuracy': accuracies}
        )
        results = results.append(rs, ignore_index=True)
        if feature_list_name == "combined":  # we are interested in the relevance of all features (i.e. 'combined') 
            imps = pd.DataFrame(importances, columns=feature_names)
    return results, imps

In summary, the test_classification() function above takes a DataFrame with a special label column holding the labels for the intended classification. It runs the cross validation test three times:

  1. using all 22 network metrics available (in the remaining columns of the DataFrame),
  2. using only generic network metrics, and
  3. using only provenance-specific network metrics.

The accuracy measures from those tests (1,000 values from each) are collated in the returned results DataFrame. The the importance measures of all the 22 metrics calculated in test (1) are also collated and returned in the imps DataFrame.